Red Wine Exploration by Oliver A. Weigand

This report explores a tidy dataset that contains almost 1,600 red wines with 11 variables on the chemical properties of the wine.

Some data wrangling has been completed to remove X from dataset and to create quality.factor in the red wine dataset.

Univariate Plots Section

## [1] 1599   13
## 'data.frame':    1599 obs. of  13 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##  $ quality.factor      : Factor w/ 6 levels "3","4","5","6",..: 3 3 3 4 3 3 3 5 5 3 ...
##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000

Our data set consist of 13 variables, with 1,599 observations.

##  int [1:1599] 5 5 5 6 5 5 5 7 7 5 ...
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000

Quality was determined by a minimum of 3 wine experts, who rated each wine between 0 (very bad) and 10 (excellent).

Quality has a normal distribution. Note that the values are integers and that minimum and maximum values are 3 and 8 respectively.

A Density Plot is provided on the right to better visualizes the distribution of the data. A Density Plot is a Histogram that uses kernel smoothing to plot values, allowing for smoother distributions by smoothing out the noise. The peaks of help display where values are concentrated over the interval.

##  fixed.acidity   volatile.acidity  citric.acid   
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000

Fixed acidity has a normal distribution, while volatile acidity has a bimodal distribution. Citric acid appears to be multimodal with 3 peaks and a tail to the right. All plots have outliers.

Most acids in wine, including fixed or nonvolatile acids are quantified as fixed acidity. The physicochemical test was for tartaric acid. Volatile acidity is the amount of acetic acid in wine, which can lead to an unpleasant, vinegar taste at high of levels. Citric acid, found in small quantities, can add ‘freshness’ and flavor to wines. All measurements are in g / dm^3.

##  free.sulfur.dioxide total.sulfur.dioxide   sulphates     
##  Min.   : 1.00       Min.   :  6.00       Min.   :0.3300  
##  1st Qu.: 7.00       1st Qu.: 22.00       1st Qu.:0.5500  
##  Median :14.00       Median : 38.00       Median :0.6200  
##  Mean   :15.87       Mean   : 46.47       Mean   :0.6581  
##  3rd Qu.:21.00       3rd Qu.: 62.00       3rd Qu.:0.7300  
##  Max.   :72.00       Max.   :289.00       Max.   :2.0000

Free sulfur dioxide, total sulfur dioxide, and sulphates all have a normal distribution with a tail to the right and outliers.

Free sulfur dioxide measures the free form of SO2 that exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion. It prevents microbial growth and the oxidation of wine. Total sulfur dioxide measures the total amount of free and bound forms of SO2. In low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine. Sulphates is a wine additive which can contribute to S02 levels, this acts as an antimicrobial and antioxidant. All measurements are in g / dm^3.

##  residual.sugar     chlorides      
##  Min.   : 0.900   Min.   :0.01200  
##  1st Qu.: 1.900   1st Qu.:0.07000  
##  Median : 2.200   Median :0.07900  
##  Mean   : 2.539   Mean   :0.08747  
##  3rd Qu.: 2.600   3rd Qu.:0.09000  
##  Max.   :15.500   Max.   :0.61100

Both residual sugar and chlorides have normal distributions with outliers and tails to the right.

Residual sugar is the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram / liter and wines with greater than 45 grams / liter are considered sweet. Chlorides measure the amount of salt (sodium chloride) in the wine. All measurements are in g / dm^3.

##     density             pH       
##  Min.   :0.9901   Min.   :2.740  
##  1st Qu.:0.9956   1st Qu.:3.210  
##  Median :0.9968   Median :3.310  
##  Mean   :0.9967   Mean   :3.311  
##  3rd Qu.:0.9978   3rd Qu.:3.400  
##  Max.   :1.0037   Max.   :4.010

Both density and pH are normal distributions with a few outliers.

Density measure the density of the wine. All observation should be very close to 1 g / cm^3, the density of water. Variations are present due to differences in the percent alcohol and sugar content. pH describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3 - 4 on the pH scale.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

Alcohol has a normal distribution with a tail to the right.

Alcohol measures the percent alcohol content of the wine, in % by volume.

Univariate Analysis

What is the structure of your dataset?

There are 1,599 red wines in the dataset with 11 input variables based on physiochemical tests (fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol), and 1 output variable based on sensory data (quality). All variables are numeric expect quality which is an integer.

Other Observations:

  • Most wines have quality rating between 5 and 6.
  • Most wines are not considered sweet (they have less than 4 grams / liter of residual sugar).
  • Most wines fall between 3.2 and 3.4 on the pH scale.
  • Most wines have an alcohol content between 9.5 and 11.1%.

What is / are the main feature(s) of interest in your dataset?

The main feature in the dataset is quality. I’d like to determine which features are best for predicting the quality of red wines. I suspect that some combination of the input variables can be used to build a predictive model for quality.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

Citric acid, residual sugar, chlorides, total sulfur dioxide, and alcohol will likely contribute to the quality of red wine.

Did you create any new variables from existing variables in the dataset?

Yes, I created a new variable called quality.factor, which is a ordered factor with 6 levels (”3”, “4”, “5”, “6”, “7”, “8”) original from the quality variable. I decided to keep both quality and quality.factor because it will give me greater flexibility and save time from having to convert from an ordered factor to and integer and back throughout the project.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

There was one unusual distribution, citric acid, which was a multimodal distribution with 3 peaks. The dataset was already tidy, so their was little need for any data wrangling. However I did remove the variable X as it was not needed for this exploration and created quality.factor for convience.

Bivariate Plots Section

Matrix of plots with wine dataset.

Plot of correlation values between quality and input variables.

Plots with main output variable.

These are plots that will look consider a variable against quality.

Fixed acidity

## 
##  Pearson's product-moment correlation
## 
## data:  quality and fixed.acidity
## t = 4.996, df = 1597, p-value = 6.496e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.07548957 0.17202667
## sample estimates:
##       cor 
## 0.1240516

On the left is a histogram of fixed acidity faceted by quality, while on the right is a boxplot combined with a scatter-jitter plot of quality and fixed acidity. After the plots is the Pearson’s product-moment correlation calculation. You can find the correlation on the very bottom of the calculation output under sample estimates: cor 0.1240516 Note that statistically significant level is set to 0.05 or 5%.

Fixed acidity has a weak positive correlation with quality.

Volatile acidity

## 
##  Pearson's product-moment correlation
## 
## data:  quality and volatile.acidity
## t = -16.954, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4313210 -0.3482032
## sample estimates:
##        cor 
## -0.3905578

Volatile acidity has a moderate negative correlation with quality.

Citric acid

## 
##  Pearson's product-moment correlation
## 
## data:  quality and citric.acid
## t = 9.2875, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1793415 0.2723711
## sample estimates:
##       cor 
## 0.2263725

Citric acid has a weak positive correlation with quality.

Free sulfur dioxide

## 
##  Pearson's product-moment correlation
## 
## data:  quality and free.sulfur.dioxide
## t = -2.0269, df = 1597, p-value = 0.04283
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.099430290 -0.001638987
## sample estimates:
##         cor 
## -0.05065606

Free sulfur dioxide has a weak negative correlation with quality.

Total sulfur dioxide

## 
##  Pearson's product-moment correlation
## 
## data:  quality and total.sulfur.dioxide
## t = -7.5271, df = 1597, p-value = 8.622e-14
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2320162 -0.1373252
## sample estimates:
##        cor 
## -0.1851003

Total sulfur dioxide has a weak negative correlation with quality.

Sulphates

## 
##  Pearson's product-moment correlation
## 
## data:  quality and sulphates
## t = 10.38, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2049011 0.2967610
## sample estimates:
##       cor 
## 0.2513971

Sulphates has a weak positive correlation with quality.

Residual sugar

## 
##  Pearson's product-moment correlation
## 
## data:  quality and residual.sugar
## t = 0.5488, df = 1597, p-value = 0.5832
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.03531327  0.06271056
## sample estimates:
##        cor 
## 0.01373164

Residual sugar does not have a statistically significant correlation with quality.

Chlorides

## 
##  Pearson's product-moment correlation
## 
## data:  quality and chlorides
## t = -5.1948, df = 1597, p-value = 2.313e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.17681041 -0.08039344
## sample estimates:
##        cor 
## -0.1289066

Chlorides has a weak negative correlation with quality.

Density

## 
##  Pearson's product-moment correlation
## 
## data:  quality and density
## t = -7.0997, df = 1597, p-value = 1.875e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2220365 -0.1269870
## sample estimates:
##        cor 
## -0.1749192

Density has a weak negative correlation with quality.

pH

## 
##  Pearson's product-moment correlation
## 
## data:  quality and pH
## t = -2.3109, df = 1597, p-value = 0.02096
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.106451268 -0.008734972
## sample estimates:
##         cor 
## -0.05773139

pH has a weak negative correlation with quality.

Alcohol

## 
##  Pearson's product-moment correlation
## 
## data:  quality and alcohol
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4373540 0.5132081
## sample estimates:
##       cor 
## 0.4761663

Alcohol has a moderate positive correlation with quality.

Plots without main output variable.

Volatile acidity and citric acid

## 
##  Pearson's product-moment correlation
## 
## data:  volatile.acidity and citric.acid
## t = -26.489, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.5856550 -0.5174902
## sample estimates:
##        cor 
## -0.5524957

Volatile acidity and citric acid have a moderate negative correlation.

Volatile acidity and sulphates

## 
##  Pearson's product-moment correlation
## 
## data:  volatile.acidity and sulphates
## t = -10.804, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3060917 -0.2147125
## sample estimates:
##        cor 
## -0.2609867

Volatile acidity and sulphates have a weak negative correlation.

Volatile acidity and alcohol

## 
##  Pearson's product-moment correlation
## 
## data:  volatile.acidity and alcohol
## t = -8.2546, df = 1597, p-value = 3.155e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2488416 -0.1548020
## sample estimates:
##       cor 
## -0.202288

Volatile acidity and alcohol have a weak negative correlation.

Citric acid and sulphates

## 
##  Pearson's product-moment correlation
## 
## data:  citric.acid and sulphates
## t = 13.159, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2678558 0.3563278
## sample estimates:
##     cor 
## 0.31277

Citric acid and sulphates have a moderate positive correlation.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

Residual sugar was the only input not to have a significant correlation with the main feature (quality). The remaining 10 outputs are listed by correlation strength:

  • Alcohol
  • Volatile Acidity
  • Sulphates
  • Citric Acid
  • Total Sulfur Dioxide
  • Density
  • Chlorides
  • Fixed Acidity
  • pH
  • Free Sulfur Dioxide

Of all 10, alcohol had the strongest correlation of 0.48, while free sulfur dioxide had the weakest significant correlation of -0.051. Also, of note, is that there was a lack of wines in the range of 0.125 and 0.25 g / dm^3 of citric acid only for quality levels 7 and 8. This may be caused by different categories of high-quality red wine purposefully have higher or lower levels of citric acid. As you may recall, citric acid can add ‘freshness’ and flavor to wine. Another interesting point is that most high-quality red wines (quality of 7 or 8) seems have sulphates levels between 0.65 and 0.82 g / dm^3.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

There where 4 interesting relationships, 2 of which had weak negative correlations, volatile acidity and sulphates, volatile acidity and alcohol. Citric acid and sulphates had a weak positive correlation, while volatile acidity and citric acid had a moderate negative correlation.

What was the strongest relationship you found?

The strongest relationship was a negative linear relationship between volatile acidity and citric acid, with a correlation of -0.55.

Multivariate Plots Section

The plots are of volatile acidity, citric acid, and quality.

You can clearly see that higher quality wines tend to have low levels of volatile acidity, and slightly higher levels of citric acid, than lesser wines.

The plots are of volatile acidity, sulphates, and quality.

The trend of higher quality wines have less volatile acidity still holds. Furthermore, you can also see that sulphates tend to have higher levels in the better-quality wines.

The plots are of alcohol, volatile acidity, and quality.

The plot continues to show the trend of higher quality wines have less volatile acidity, but is also shows that those same wines tend to have higher alcoholic levels.

These plots are of citric acid, sulphates, and quality.

The trend presented in this plot is that higher quality wines tend to have more sulphates present than lower quality wines.

## 
## Calls:
## m2: lm(formula = quality ~ alcohol + volatile.acidity, data = wine)
## m4: lm(formula = quality ~ alcohol + volatile.acidity + sulphates + 
##     citric.acid, data = wine)
## m6: lm(formula = quality ~ alcohol + volatile.acidity + sulphates + 
##     citric.acid + total.sulfur.dioxide + density, data = wine)
## m8: lm(formula = quality ~ alcohol + volatile.acidity + sulphates + 
##     citric.acid + total.sulfur.dioxide + density + chlorides + 
##     fixed.acidity, data = wine)
## m10: lm(formula = quality ~ alcohol + volatile.acidity + sulphates + 
##     citric.acid + total.sulfur.dioxide + density + chlorides + 
##     fixed.acidity + pH + free.sulfur.dioxide, data = wine)
## 
## ==============================================================================================
##                              m2            m4            m6            m8           m10       
## ----------------------------------------------------------------------------------------------
##   (Intercept)               3.095***      2.646***     -7.009        28.165         8.223     
##                            (0.184)       (0.201)      (11.972)      (15.083)      (17.026)    
##   alcohol                   0.314***      0.309***      0.305***      0.268***      0.291***  
##                            (0.016)       (0.016)       (0.020)       (0.021)       (0.023)    
##   volatile.acidity         -1.384***     -1.265***     -1.247***     -1.137***     -1.087***  
##                            (0.095)       (0.113)       (0.116)       (0.120)       (0.121)    
##   sulphates                               0.696***      0.710***      0.916***      0.891***  
##                                          (0.103)       (0.104)       (0.112)       (0.112)    
##   citric.acid                            -0.079        -0.093        -0.198        -0.174     
##                                          (0.104)       (0.120)       (0.145)       (0.147)    
##   total.sulfur.dioxide                                 -0.002***     -0.002***     -0.003***  
##                                                        (0.001)       (0.001)       (0.001)    
##   density                                               9.820       -25.583        -3.864     
##                                                       (11.931)      (15.122)      (17.385)    
##   chlorides                                                          -1.584***     -1.879***  
##                                                                      (0.408)       (0.419)    
##   fixed.acidity                                                       0.055**       0.013     
##                                                                      (0.017)       (0.023)    
##   pH                                                                               -0.482**   
##                                                                                    (0.181)    
##   free.sulfur.dioxide                                                               0.005*    
##                                                                                    (0.002)    
## ----------------------------------------------------------------------------------------------
##   R-squared                 0.317         0.336         0.344         0.356         0.360     
##   adj. R-squared            0.316         0.334         0.342         0.353         0.356     
##   sigma                     0.668         0.659         0.655         0.650         0.648     
##   F                       370.379       201.777       139.219       109.751        89.354     
##   p                         0.000         0.000         0.000         0.000         0.000     
##   Log-likelihood        -1621.814     -1599.093     -1589.409     -1575.112     -1569.735     
##   Deviance                711.796       691.852       683.523       671.409       666.908     
##   AIC                    3251.628      3210.186      3194.818      3170.224      3163.470     
##   BIC                    3273.136      3242.448      3237.835      3223.995      3227.996     
##   N                      1599          1599          1599          1599          1599         
## ==============================================================================================

Linear model utilizing 10 out of the 11 input variables.
Model 10 (m10) is the final model with the highest R-squared and AIC value.

R-Squared tells us the proportion of variation in the dependent variable that has been explained by this model. Typically, we want to see a R-squared of 0.7 or greater, but we don’t necessarily discard a model based on a low R-Squared value. Its a better practice to look at the AIC and prediction accuracy on validation sample when deciding on the efficacy of a model. The Akaike’s information criterion - AIC (Akaike, 1974) measures the goodness of fit of an estimated statistical model and can also be used for model selection. The lower the AIC value is the better.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

A few relationships observed were:

  • Higher quality wines tended to have lower levels of volatile acidty.
  • Higher quality wines tended to have slightly higher levels of citric acid present.
  • Higher quality wines tended to have more alcoholic content.
  • Higher quality wines tended to have slightly higher levels of sulphates present.

When comparing volatile acidity, alcohol, and quality, the correlations between higher quality wines have higher alcoholic content and less volatile acidity became evident.

Were there any interesting or surprising interactions between features?

One surprising feature was that higher quality tended to have slightly higher levels of sulphates, which is salt. This is surprising because it is breaking a preconception that salter wines may be of higher quality. That being said, by more salt, we are talking about a tenth (0.1) of a g / dm^3, which is a very small amount.

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.

We did create a linear model. The best fitted model created had a R-squared value of 0.360 and an AIC value of 3163. The largest limitation this model faces is that it quality was determined by taste testers. While expert taste testers know a lot more about wine quality than I do, it still presents an element of human. Furthermore, quality in of its self is may not directly relate to an enjoyable wine for your average consumer.

This model is statistically significant with a P value less than 0.05. While we would prefer a higher R-squared value, model 10 (m10) has the highest value available.


Final Plots and Summary

Plot One

Description One

This plot is of the correlation values between quality and all of the input variables.

It was chosen as one of the 3 final plots because it enables the reader to grasp the full range of the different correlation values between quality and the input variables. It is important for the reader to understand the different correlation values. A correlation is a statistical measurement that suggest the level of linear dependence between 2 variables. In other words, we can use this as a ruff measurement of which variables we should pay the most attention to (such as alcohol) or which variables will probably be of little importance (such as residual sugar). These values can range from -1 to +1, and the closer the value is to 0 the weaker the correlation is. Furthermore, we are considering any value equal to and less than 0.05 to be statistically insignificant.

Plot Two

Description Two

This is a jitter plot of alcohol content by quality level with a boxplot and linear regression line overlay.

It was chosen because it clearly presents the relationship between alcohol and quality, the strongest correlation of all the input variables. The jitter plot is a scatter plot that adds a small amount of randomness to the discrete position (quality), which helps to avoid overpotting, and more clearly show concentrations. The box plot is a standardized way of displaying the distribution of the data, and the linear regression line allows us to see the trend that the correlation value was calculating the level of linear dependence of.

Plot Three

Description Three

This is a jitter and smooth line plot of volatile acidity, alcohol, and quality.

It was chosen because it allows us to see the multivariable trends between volatile acidity, alcohol, and quality. The jitter plot allows us to see the concentrations and quality level, thus enabling use to visually determine if any trends are present. The smooth line plot is another tool that is very effective in allowing us to see what the trend actually looks like. However, due to its nature, it can over generalize, although that can be mitigated through the use of the confidence bands.


Reflection

The Red Wine Quality dataset was a tidy dataset with almost 1,600 wines, 11 physicochemical (input) variables, and 1 sensory (output) variable. The majority of the wines where of a normal quality with very few being considered of high or low quality. This is the greatest limitation in the dataset because it made it much more difficult to identify chemical trends that could be used to predict high- or low-quality wines. The majority of the wines where not considered sweet, where considered acidic (pH score between 3.2 and 3.4) and had an alcohol content between 9.5 and 11.1%. The majority of the input variables had weak correlation quality, with the strongest correlation being alcohol, followed by volatile acidity. Residual sugar was the only variable to have a statically insignificant correlation to quality. It was found that higher quality wines where more likely to have higher alcohol content, and lower volatile acidity levels. Furthermore, higher quality wines also tended to have slightly higher citric acid and sulphates levels. Lastly, when comparing alcohol content, volatile acidity, and quality, you can clear see the concentration of higher quality wines. All of these findings indicate that alcohol and volatile acidity would be the best 2 indicators of wine quality, a finding supported by the linear regression model which gave these 2 variables high significant ratings.

When starting this analysis, my first thought would be that this would be easy. Primarily, I would only need to consider correlations to quality, make some nice plots, and submit. What I was not expecting was the vast number of low correlations to quality, and intercorrelation between input variables, such as free and total sulfur dioxide. This called for a more in-depth exploration of the data. The box plots and linear regression line overlays worked well on viewing this behavior. Furthermore, the smooth lines in the multivariable plots help to spot tends. However, that brings me the last issue experienced, that is the fact that discrete variables can be rather difficult to analyze. If you put a discrete variable (like quality) into an ordered pair with 2 levels, then can begin to overgeneralize. However, if you do not, then it may be rather had to identify any trends.

As already mentioned the greatest limitation was the lack of high- and low-quality wines in the dataset. However, another limitation is that the quality was determined by human sensory data. While measures were taken to limit the human error, averaging the quality rating from 3 separate testers, it will always remain present in the dataset. The best way to improve this analysis is to obtain more data entrees of high- and low-quality wines. This would balance out the quality variables, allowing for more accurate trends to be identified, regarding high- and low-quality wines.